Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

7.4 Compression

that the probabilities are respectively one half comma one fourth comma one eighth ¹

2^,¹

4^,¹

8^{, and}^{one eighth}¹

8^{. Then, from Eq. (}^6.5^{) we deter-}

mineupper I equals 1.75I = 1.75 bits per symbol, so we should be able to encode the message (whose

relative entropy is seven eighths ⁷

8 ^{and hence redundancy}^{upper R}^R^is^{one eighth}¹

8^{) such that a smaller channel will}

sufﬁce to send it. The following code may be used: ⁸

↓ ^A

110

111

The average number of binary digits used in encoding a sequence ofupper NN symbols will

beupper N left parenthesis one half times 1 plus one fourth times 2 plus two eighths times 3 right parenthesis equals seven fourths upper NN( ¹

2 ^×¹⁺¹

4 ^×²⁺²

8 ^×³⁾⁼⁷

4 ^N^{. 0 and 1 can be seen to have equal probabilities;}

hence, upper II for the coded sequence is 1 bit/symbol, equivalent to 1.75 binary symbols

per original letter. The binary sequence can be decoded by the transformation

↓ ⁰⁰

A^,

B^,

C^,

D^,

The compression ratio of this process is seven eighths ⁷

8^{. Note, however, that there is no general}

method for ﬁnding the optimal coding.

Problem. Using the above coding, show that the 16-letter message “ABBAAAD-

ABACCDAAB” can be sent using only 14 letters.

The Shannon technique requires a long delay between receiving symbols for

encoding and the actual encoding, in order to accumulate sufﬁciently accurate indi-

vidual symbol transmission probabilities. The entire message is then encoded. This

is, of course, a highly impractical procedure. Mandelbrot (1952) has devised a proce-

dure whereby messages are encoded word by word. In this case the word delimiters

(e.g., spaces in English text) play a crucial rôle. From Shannon’s viewpoint, such a

code is necessarily redundant, but on the other hand, an error in a single word renders

only that word unintelligible, not the whole message. It also avoids the necessity for

a long delay before coding can begin.

The Mandelbrot coding scheme has interesting statistical properties. One may

presume that the encoder seeks to minimize the cost of conveying a certain amount

of information using the collection of words that are at his disposal. If p Subscript ipi is the

probability of selecting and transmitting theiith word, then the mean information per

symbol contained in the message is, as before, minus sigma summation p Subscript i Baseline log p Subscript i−^Σpi log pi. We may suppose that

the cost of transmitting a selected word is proportional to its length. Ifc Subscript ici is the cost of

transmitting theiith word, then the average cost per word issigma summation p Subscript i Baseline c Subscript i^Σpici. Minimizing the

distribution of the probabilities while keeping the total information constant (using

Lagrange’s method of undetermined multipliers) yields

p Subscript i Baseline equals upper C e Superscript minus upper D c Super Subscript i Superscript Baseline commapi = Ce⁻^Dcⁱ,

(7.4)

8 Elaborated by D. A. Huffman.